DOCUMENT RESUME 



ED 307 339 



TM 013 527 



AUTHOR 
TITLE 

PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Linacre, John M. 

Objectivity for Judge-Intermediated Certification 

Examinations. 

Mar 89 

13p.; Paper presented at the Annual Meeting of the 
American Educational Research Association (San 
Francisco, CA, March 27-31, 1989) . 
Reports - Evaluative/Feasibility (142) — 
Speeches/Conference Papers (150) 

MF01/PC01 Plus Postage. 

*Evaluators; *!nterrater Reliability; *Latent Trait 
Theory; *Licensing Examinations (Professions); 
Models; Testing Problems 

* Fairness; *Objectivity; Rasch Model; Stochastic 
Approximation Method 



ABSTRACT 

An accepted criterion for gauging the fairness of 
examinees 1 scores, derived from judge-awarded ratings, has been the 
size of the correlation between the judges and the inter-rater 
reliability. Various means of achieving inter-rater reliability were 
reviewed, and a model to measure inter-rater reliability is 
forwarded. Bt^.h theoretical and practical considerations mandate that 
perfect inter-rater reliability can never Le achieved. A stochastic 
element always remains. Objective measurement of examinees, freed 
from the severity of the judges and the definition of the rating 
scale, can be obtained by capitalizing on the stochastic nature of 
ratings. The resulting measurement model is of the type developed by 
Rasch. Examples of the model are provided. (TJH) 
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Abstract: 

An accepted criterion for gauging the fairness cf examinees 1 scores, derived 
from judge-awarded ratings, has been the nise of the correlation between 
between the judges, the inter-rater reliability. Both theoretical and 
practical considerations nandate that perfect inter-rater reliability can 
never be achieved. A stochastic eleaent til ways remains. Objective 
measurement of the exaaineem, freed froa the severity of the judges and the 
definition of the rating scale, can be obtained > capitalizing on the 
stochastic nature of ratings. The resulting measurement aodel is of the type 
developed by Rasch, and examples of the aodel are given. 

Key- words: 
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I. Introduction: 

The use of judges to assess the performance of exaainees is viewed as 
undesirable but, on occasion, necessary. The reasons for this reluctance to 
use judges is clear. As Braun has written of essay examinations, " large 
numbers of griders aust be trained and supervised and the Maintenance oi 
uniform standards across graders and over aany days often becomes 
problematic. An immediate consequence is that the reliability of the scores 
is often substantially less than unity 11 (Braun 1988 p.l). The problea of 
judge training is directly related to the problea of inter-rater 
reliability. How the goal of a perfect test is perceived affects how judges 
are trained towards that goal. 

The quest for perfect reliability in judge ratings of exaainees is pervasive 
in the literature of judging and has motivated the implementation of 
techniques to inprove judging quality, such as clearer definition of the 
categories of rating scales and more precise instructions to the judges. 
But, like the search for El Dorado, the quest for reliability is ultimately 
dooaed to failure. This does not mean, however, that the ratings of judged 
performances cannot be just as trustworthy as the responses to 
multiple-choice questions. What it does mean is that emphasis must be 
placed, not on numerical agreeaent between the ratings of the different 
judges (reliability), but on agreeaent between the intentions of the judges. 
This change cf emphasis enables the construction of measures for the 
examinees which are independent, in meaning, of the particular judges who 
rated each performance and so are, colloquially speaking, "fair", or 
technically speaking, "objective 11 . 



II. Why is the quest for reliability doomed to failure ? 

The ideal judging situation would appear to be that in which all judges agree 
on every rating of exaainees that they share in common. An example of this 
is shown in Figure 1. Examinees have been given the same ratings by two 
independent judges. For the purposes of this immediate discussion, the 
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rating scale is assumed to be an equal interval scale, so that the usual 
arithmetic operations can be performed on it. This is an assumption in 
most analysis of rating scales, such as Guilford (1954 p. 278-301). 



Exaainees 

12 3 4 5 6 


Judge A 


4 3 4 2 1 3 


Judge B 


4 3 4 2 1 3 



Figure 1. Perfect agreement in judges ratings of six examinees on some 
task. The rating scale has 5 categories in ascending v>rder of performance 
level, 0,4. 



Whether perfect agreement is, in fact* the ideal has been questioned by a 
number of researchers. On the one hand, from the empirical viewpoint, "when 
two examiners award different marks, the average is more likely to be 
correct, or nearly correct, than it is when they award the same mark 91 (Harper 
1976 p.262). On the other hand, from the theoretical viewpoint, "it ir 
usually required to have two or more raters who are trained to agree on 
independent ratings of the same performance. It is suggested that such a 
requirement may produce a paradox of attenuation associated with item 
analysis, in which too high a correlation between items, while enhancing 
reliability, decreases validity" (Andrich 1984). 



Exaainees 

1 2 3 4 5 6 


Judge A 


4 3 4 2 1 3 


Judge C 


3 2 3 1 0 2 



Figure 2. Perfect inter-rater reliability between in the ratings of six 
examinees on some task. Judge C is one score-point more severe than Judge A. 



The topic of complete agreement, however, is a moot point, because it cannot 
be expected to occur in any large-scale examination situation. Given that 
raters do differ, let us consider the question of perfect judge reliability. 
Figure 2 gives an example of this, under the same conditions and with the 
same six examinees as Figure 1. 

It can be seen that the ratings given by the judges are perfectly correlated, 
and so perfectly reliable according to indices based on product-moment 
correlation. However, according to indices based on nominal agreements in 
the ratings, such as Cohen's (1960) Kappa, there is no agreement at all. It 
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can be seen that these judges do agree on the rank order of the examinees, so 
that to report that they have no agreement at all is clearly Misleading. 
Accordingly we will consider these judges to have perfect inter-rater 
reliability, but we discern that judge A is one score-point more lenient than 
Judge C. In this paper, we will consider judges to differ only in 
leniency/severity because this is usually a large component of the variance 
of the ratings, and, as Braun suggests as a result of his study, "adjusting 
scores for the differences between [judges] should inprove the reliability" 
(Braun 1988 p.8). 

The possibility of perfect inter-rater reliability of judges whose severity 
differs by one score point raises the question of what would be the ratings 
given by a perfectly reliable judge who is only 0.5 score points more severe 
than Judge A. Figure 3 takes one suggestion. Judge D mist accomodate his 
behavior to the predefined rating scale, and thus his 0.5 point difference is 
expressed by awarding half the examinees a rating one point lower than Judge 
A, and the other half the sane rating as Judge A. Consequently, two judges, 
who, in intention, have perfect reliability, are observed to have a 
correlation coefficient of 0.895. 

Let us say that, as a result of some analysis, Judge D has been determined to 
be 0.5 score points sore severe than Judge A, and a correction of 0.5 points 
is sade in all Judge D's ratings. The outcome is shown in Figure 4. The 
inter-rater reliability has not changed, nor, after rounding to the nearest 
integer category, has the nosinal agreement in categories. The correction 
for judge severity has sade no improvement in the reliability of this set of 
ratings. 



Exaainees 
1 2 3 4 5 6 


Judge A 


4 3 4 2 1 3 


Judge D 


3 3 3 2 0 3 



Figure 3. Ratings given by two judges when one judge is 0.5 score-points 
more lenient then the other. 



Examinees 

1 2 3 4 5 6 


Judge A 


4 3 4 2 1 3 


Judge D 


3.5 3.5 3.5 2.5 0.5 3.5 



Figure 4. Ratings given by two judges when a judge severity of 0.5 
score-points has been corrected for. 
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Another judge, 5, who is also 0.5 score points more severe that Judge A, now 
awards his ratings and these are shown in Figure 5. Again, Jud£e E expressed 
his severity by awarding half the examinees a rating one score point below 
that of Judge A, and the other half the same rating as judge A. Again their 
agreement, in intention, is perfect but their correlation coefficient is 
0.895. Now compare Judges D and E, who are both 0.5 score points sore severe 
than Judge A. Their ratings are shown in Figure 6. 

We know that both Judge D and Judtfe E have the saae degree of severity and 
agree, in intention, as to the standard of performance of the examinees, but 
the constraints of the rating scale have caused then to express it 
differently. Judge D and Judge E are reported to have a correlation 
coefficient 0.645, but, from the point of view of an examining board, even 
this understates the problem. If a rating of 3 or 4 constituted a pass, and 
0,1 or 2 constituted a failure, then Judge D passes 4 and fails 2, and Judge 
E passes 2 and fails 4. Judge E's two passes, however, are reported as 
maximum scores (4). In traditional analysis, Judges D would be reported as 
being more severe but generally in agreement with Judge A, but Judge E would 
be reported as having a marked inversion of "central tendency" • 

This paradox of lack of reliability has been presented in terms of one judge 
being 0.5 score points more severe than another. The very same situation 
arises, however, when one examinee is 0.5 score points less able than 
another i even when there is a perfect correlation of judge intentions. 
Indeed, since the process of measurement is based on the concept that there 
is a continuum of examinee performance, examinees will always be found who 
perform, for any particular judge, at or near the transition between adjacent 
categories. 



Exaainees 

1 2 3 4 5 6 


Judge A 


4 3 4 2 1 3 


Judge E 


4 2 4 1 1 2 



Figure 5. Further example of ratings given by two judges when one judge is 
0.5 score-points more severe than the other. 



Exaainees 
1 2 3 4 5 6 


Judge D 


3 3 3 2 0 3 


Judge E 


4 2 4 1 1 2 



Figure 6. Comparison of the ratings awarded by two judges of equal severity. 
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Iven given ideal judges , it is clear that a performance at a level of 2.5 
score-points will be awarded 3 points or 2 points with approximately equal 
frequency. However, with real judges, however well-trained and experienced, 
what will happen to a performance at a level of 2.49 score-points ? It 
cannot be expected that the judges will have such precise discrimination that 
this performance will always be awarded a rating of 2, but never a rating of 
3. Indeed we expect a greater frequency of 2 9 s than 3's, but not a very 
great diffeience. By extension of the same argument, the situation in Figure 
7 can be expected to result. What appeared to be a deterministic dec ; 8 ion by 
judges is revealed to be a probabilistic one. This stochastic element in 
rating is what dooms the quest for perfect inter-rater reliability. 

Probabi 1 ity Probabi 1 i ty 

of a 2 of a 3 




2.5 S 
Examinee ability (represented by score-points) 

Figure 7. Probability of rating expected to be given to an examinee of given 
ability. 



Definition: Unacceptable J Defective J Low { Acceptable | Superior 
Category: 0 J 1 { 2 J 3 J 4 



t H 1 1 1 1 1 1 1- 

Less Able More Able 

Linear Performance Level Scale 
Figure 8. Relationship between category numbers and performance level. 

III. Moving from reliability to objectivity: the nature of the rating sca'e. 

The fact that the scale we have been discussing has categories numbered 0,1, 
2,3,4 does not force the categories to represent equal increments in 
performance. In the next judging session, the examining board might decide 
to introduce a new category between previous categories 2 and 3, and then 
renumber the scale as 1,2,3,4,5,6. If the old scale were linear, then the 
new scale isn't, and vice-versa. The numbers assigned to categories are 
merely a convenience of labelling which enables an ordering of performance 
levels, but they are not a direct expression of the amount of performance 
each category represents. Figure 8 illustrates this in the context of a 
realistic rating scale. Category 0, "Unacceptable", represents all levels of 
performance below category 1, an infinite range, and similarly category 4, 
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"Superior" represents all levels of performance above category 3. 
Consequently, no Batter bow tbe categories are defined, it is impossible for 
all of tbe* to represent equal ranges of performance, and thus it is 
impossible for a numerical scale made up of category numbers to be linear. 

We have seen illustrated in Figure 7 that there is a stochastic element to 
the awarding of categories. If ire extend this finding to Figure 8, it can be 
seen that category 2 represents such e i arrow range of real performance, that 
a "true" or latent performance level on the threshold between a 1 and a 2 
could well be rated not only as a 1 or a 2, but even as a 3. In fact, the 
stochastic nature of judge rating implies that whatever the examinee's 
performance level, there is some probability that the judge may award a 
rating in any of the categories, though, for a well designed scale, the 
category nearest the examinee's performance level has the highest 
probability. Figure 9 depicts what occurs. This stochastic behavior is what 
has caused the quest for reliability to fail, but it is this very behavior 
which provides the key to objectivity and so fairness. 




Linear Performance Level Scale 
Figure 9. The probabilistic nature of the awarding of ratings by judges. 



IV. The aim of the judging process. 

For the examination described here, the ultimate goal of the judging process, 
from the viewpoint of an examining board, is not to determine some "true" 
rating on which ideal judges would agree, but rather to estimate the 
examinee's latent ability level, of which each judge's rating is a 
manifestation. This is the very essence of objectivity. In order to 
supersede the local particularities of the judging situation, each judge must 
be treated as though he has a unique severity, each examinee as though he 
has a unique ability, and each rating scale as though it has a unique 
formulation. This means that many interesting aspects of behavior must be 
regarded as incidental, and each rating considered to be the probabilistic 
result of only three interacting components: the severity of a judge, the 
ability of an examinee and the structure of the rating scale. With these 
assumptions, it is possible to obtain the outcome that the examining board 
desires, which is an estimate of the ability of each examinee, freed from the 
level of severity of the particular judges who happened to rate his 
performance and also freed from the arbitrary manner in which the categories 
of the rating scale have been defined. The more that incidental aspects of 
behavior are in evidence in the ratings, tbe more uncertainty there is in the 
estimates of the examinees' abilities, and the less confidence there is that 
the aim of the judging process has been real i ted in the judges' ratings. 



Accurate measurement thus depends not on finding the one "ideal" jtdge but in 
discerning the intentions of the actual judges through the way in which they 
have replicated their behavior in all the ratings each has Bade. 
Consequently, judges cannot be assumed to be replications of one another, and 
to assert that examinees are sampled from a normal population is to 
predetermine the results of the examination. This indicates that judge 
training would be more productive if it were aimed towards fostering 
consistent behavior representative of the judge's intentions, rather than 
towards making the judge replicate a notional "ideal" judge. 



Judge- F 
Categories: 2 3 


Judge G 2 


N22 N32 


3 


N23 N33 



Figure 10. Hypothetical counts of the awarding of categories by replications 
of two judges and the same examinee. Only Instances when both judges have 
used categories 2 and/or 3 have been recorded. 



Judge P 
Categories: 2 3 


Judge G 2 


PG2*PP2 PG2*PP3 


3 


PG3*PP2 PG3*PP3 



Figure 11. Probabilities c£ the awarding of categories by two judges. Only 
instances when both judges have used categories 2 and/or 3 have been 
recorded. PF2 is the probability of Judge F awarding a 2, and PQ2 is the 
probability of Judge G awarding a 2, and similarly for PF3 and PG3. 



V. The probabilistic nature of ratings as a means to objective measurement. 

It is the overlapping of category probabilities in Figure 9 that provide the 
means for objective measurement. For a judge of given severity, each 
performance level is determined uniquely by the probabilities of awarding the 
different categories. These probabilities are real if xl empirically in the 
data by the frequencies with which each judge awards each category. In 
principle, if two judges were to rate innumerable replications of the same 
examinee, the relative frequency with which they award the different 
categories would reveal their relative severity precisely and uniquely. 
Figure 10 summarises an example of the counts of the ratings after a number 
of such replications. 

However, the examining board's intention is that the judging situation be 
stable, so that, as the number of replications increase, the frequencies in 
Figure 10, when divided by the number of replications, approach the 
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underlying probabilities, P02 that Judge G awards a 2 and PG3 that be awards 
a 3, and similarly PF2 and PP3 for Judge F. Since the two judges rate 
independently, the probabilities of their pairs of ratings are as shown in 
Figure 11. 

The comparison of t!*e severities of Judges F and G is based on the 
information in Figure 11 which contrasts their behavior* This is expressed 
by the probabilities in the cells which reflect their disagreement, the upper 
right and bottom left cells. The quantitative comparison itself is formed by 
the ratio (PG2*PF3)/(PG3*PF2). The top left and bottom right cells of Figure 
11 give the probabilities of judge agreement, but they do not directly allow 
us to contrast the judges 1 behavior, and so do not reveal their relative 
severity. The probabilities themselves cannot be observed directly, even in 
theory, but they underlie the frequencies observed in the corresponding cells 
of Figure 10. Thus the quantitative comparison of Judges F and G is 
estimated by the ratio (N32/N23) of Figure 10. 

This theoretical pair-wise comparison of judges could be continued across all 
pairs of judges, all pairs of categories, and all examinees. Farther, a 
comparison of the abilities of each pair examinees could be made in the same 
way by considering the ratings awarded by numerous replications of judges of 
the same severity. In practice, however, the nature of the replications 
contained in the judges 1 ratings is not of numerous repetitions of the 
identical situation, but rather of numerous repetitions of the identical 
process. Each repetition v on tains some different subset of the parameters of 
the examination: judges 9 calibrations, examinee measures and rating scale 
calibrations. For objectivity, the values of these parameters must be 
regarded as fixed throughout the examination. 

The nature of the rating scale, that is how far apart the categories 
boundaries are along the performance scale, is revealed by bow the ratings 
given by judges of particular severity to examinees of particular ability are 
distributed among the categories. The severity of each judge is determined 
by his perception of overall examinee performance as revealed in the 
distribution of the ratings he awarded. The performance level of each 
examinee is determined by means of the performance levels implied in the 
ratings given by judges, each with a particular level of severity. Thus each 
examinee is represented by one measure of ability, as the examining board 
intended, and each judge by one measure of severity. Parameters are thus 
estimated jointly, but, once this process has been successfully accomplished, 
it no longer matters which judges rated which examinees. 



VI. The objective measurement model. 

The comparison of jr -ges, presented in Figure 11, leads directly to the model 
necessary and sufficient for objective measurement of the examinees, which 
must obtain if the empirical set of ratings is to be statistically coherent, 
with all the parameters expressed on a linear scale. This measurement model, 
which is necessary for objectivity, was first proposed by Georg Rasch 
(1960/1980), and its extension to rating scales is given in Wright and 
Masters (1982). It has been further generalised to many-faceted judging 
situations (Linacre 1987). 
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The aeasureaent model applicable to tbe examination described here is: 
log(Pnjk/Pnjk-l) = Bn - Cj - Pk 
where 

Pnjk is the probability of examinee n being awarded by 

judge j a rating in category k 
Pnjk-1 is the probability of examinee n being awarded by 

judge : a rating in category k-1 
Bn is the ability of examinee n (the performance level 

on a linear scale which tbe examining board really wants) 
Cj is the severity of judge j 

Fk is the difficulty of the step from category k-1 to 
category k of the rating scale. 

This model takes advantage of the fact that there is some probability of any 
judge rating any examinee in any category, as presented in Figure 9. This 
model i however, dictates the precise form of the probability curves necessary 
for objective measurement. The modelled forms of the curve must, and do, 
occur in practice when the judging process is truly in accord with the 
examining board's intentions that each examinee can be characterised by one 
ability parameter. 

In the model equation, the logarithm of tbe ratios of probabilities, the 
"log-odds", is used to determine the ability of each examinee, the severity 
of each judge and the structure of the rating scale. Tbe estimated measures 
and calibrations are obtained by fitting the actual ratings <?iven by the 
judges into the framework determined by the model. This can be done by means 
of the technique of maximum likelihood which yields the estimated measures 
and calibrations which are most likely to produce tbe ratings that were 
awarded. These estimates are in "log-odds units" (logits) which form an 
interval scale and are equivalent to the inches or meters of physical 
science. The estimation procedure also provides standard errors, which show 
hew accurately the measures have been determined. A further vital outcome 
are fiw statistics which indicate whether the process of measuring the 
examinee's ability has been successful. 

When the entire set of ratings is statistically coherent, then each rating 
cooperates in the simultaneous estimation of the parameters of the three 
facets within one overall framework, provided that sufficient overlap has 
been built into the judging plan to allow all judges and examinees to be 
integrated into a single global frame of reference. Neither complete data 
(every judge rating every examinee) nor complex judging plans (e.g. partial 
incomplete block designs) are required. 



VII. Objective measurement for more complex judging situations. 

The discussion in this paper has presented a simplified problem, as one 
example of many-faceted Rasch measurement. This theory can be applied to 
measurement in more complex situations. In principle, each new facet of a 
judging situation introduces into the general model equation its own set of 
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parameters. There is no requirement that every rating be formed out of the 
same combination of facets, only that the ratings form part of one overall 
design. 



An example would be a certification examination in which each exaainee is 
rated on one skill itea by two judges froa a panel of judges* This skill 
itea is rated on a 5 point rating scale » and could be the perforaance of soae 
laboratory procedure. Bach candidate is also rated by another judge as to 
his success or failure on a second skill itea, which could be the accuracy of 
his report of the outcoae of the laboratory procedure. The judges are 
rotated so that each exaainee is rated by three different judges , and each 
judge rates both skill iteas and is also paired with every other judge over 
the course of the judging session. 

The aeaaureaent aodel for t' *irst skill itea, with 5 categories, thus 
beco ies 

log(Pnljk/Pnljk-l) = Bn - Dl - Cj - Flk 
where 

Pnljk is the probability of exaainee n's perforaance on the 

first skill itea beiag awarded by judge j a rating of k 
Pnjk-1 is the probability of exaainee n being awarded k-1 
Bn is the ability of exaainee n 
Dl is the overall difficulty of skill itea 1 
Cj is the severity of judge j 

Flk is the difficulty of the step froa category k-1 to 
category k for skill itea 1. 

This first aodel is invoked simultaneously with the aeasureaent aodel 
for the second dichotoaous skill itea: 

log(Pn2jl/Pn2jO) = Bn - D2 - Cj 

where 

Pn2jl is the probability of exaainee n's perforaance on the 
second skill itea being rated by judge j as successful 

Pn2j0 is the probability of exaainee n being rated as failed, 
so that Pn2jl + Pn2j0 = 1 

Bn is the ability of exaainee n (same as for itea 1) 

D2 is the difficulty of skill itea 2 

Cj is the severity of judge j (saae as for itea 1) 

The simultaneous application of these models would produce objective, linear, 
measures for each examinee, with their associated standard errors. The 
self -consistent behavior of the judges as well as the overall success of 
the measurement process could be verified by reference to well-defined fit 
statistics, in spite of the fact that this design includes very little 
duplicate judging and would not be amenable to most forms of analysis. 

In situations in which examinees are rated on a number of items, duplicate 
ratings can be avoid entirely. This can be done by arranging for each 
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component part of an exasinee's performance to be rated by a different judge 
of a judging teas, with a judging plan which rotates judges across skill 
items and into different judging teams during the judging session . 



VIII. « conclusion. 

A** opting to determine an examinee' performance in terms of judge agreement 
on jl "true" rating is seen to be a h - jeless task. Even vith judges of equal 
beverity, their agreement, and hence reliability, will be affected by bow 
close an exaainee'8 performance is to a category boundary. Further, the 
arbitrary definition of the categories aakes any atteapt to use their numbers 
as the basis of arithmetical operations an exercise in dubious approximations. 

Objectivity in examinations is obtained through a consideration of intention. 
Are examinee measures to be based on serendipitous numerical agreement in the 
ratings given by the judges, or are the examinee measures to be da* ? mined 
from the intentions of the judge* as revealed through a consideration of the 
information contained in all their ratings ? If the intention of the 
exa* at ion board is to determine a measure for examinee on an interval scale 
amenable to arithmetical manipulation and general liable beyond the particular 
details of the judging situation, then the aar^- faceted Rasch measurement 
model is the model required for such objectivity. 
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